Take Home_Ex03

Author

Dabbie Neo

Published

June 3, 2023

Modified

June 17, 2023

1. Background

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

With reference to Mini-Challenge 3 of VAST Challenge 2023 and by using appropriate static and interactive statistical graphics methods, we will be helping FishEye to better understand fishing business anomalies.

2. Data Source

The data is taken from the Mini-Challenge 3 of VAST Challenge 2023.

3. Data Preparation

3.1 Install and launching R packages

The code chunk below uses p_load() of pacman package to check if packages are installed in the computer. If they are, then they will be launched into R. The R packages installed are:

pacman::p_load(jsonlite, tidygraph, ggraph, 
               visNetwork, graphlayouts, ggforce, 
               skimr, tidytext, tidyverse, patchwork, ggiraph)

3.2 Loading the Data

fromJSON() of jsonlite package is used to import MC3.json into R environment.

mc3_data <- fromJSON("data/MC3.json")

The output is called mc3_data. It is a large list R object.

3.3 Extracting edges

The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.

mc3_edges <- as_tibble(mc3_data$links) %>% 
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()

3.4 Extracting nodes

The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.

mc3_nodes <- as_tibble(mc3_data$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id, country, type, revenue_omu, product_services) #select() used to organise the sequence of col

3.4 Initial Data Exploration

3.4.1 Exploring the edges data frame

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.

skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

The report above reveals that there is no missing values in all fields.

In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

DT::datatable(mc3_edges)

Now, we will plot the distribution of the type of relationship that exist between the source and target and their corresponding frequency.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

hist_type <- ggplot(data = mc3_edges,
       aes(x = type)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
  labs(title = "Distribution of Relationship Types", x = "Type", y = "Count") +
  theme(plot.title = element_text(face = "bold"))

# hist_type

There are two types of relationship; Beneficial Owner and Company Contacts, with a total of 16,792 count for the former and 7244 for the latter.

Next, we will explore further the number of companies that a owner usually owns. If we observe that the owner owns more companies compared to the norm, these owners may be flagged as suspicious and we could further focus our investigation on them.

To begin, we will first filter out those type == “Beneficial Owner” and the code chunk are as shown below,

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

mc3_edges_owner <- mc3_edges %>%
  filter(type == "Beneficial Owner") %>% 
  group_by(target, type) %>%
    summarise(no_of_companies = n()) %>%
  ungroup()
mc3_edges_owner
# A tibble: 15,305 × 3
   target        type             no_of_companies
   <chr>         <chr>                      <int>
 1 Aaron Adams   Beneficial Owner               1
 2 Aaron Adkins  Beneficial Owner               1
 3 Aaron Allen   Beneficial Owner               1
 4 Aaron Alvarez Beneficial Owner               1
 5 Aaron Baker   Beneficial Owner               1
 6 Aaron Beasley Beneficial Owner               1
 7 Aaron Berry   Beneficial Owner               1
 8 Aaron Black   Beneficial Owner               1
 9 Aaron Boyle   Beneficial Owner               1
10 Aaron Carroll Beneficial Owner               1
# ℹ 15,295 more rows

We can also plot out the distribution of companies beneficial owners own using ggplot.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Create a ggplot histogram
gg_hist_own <- ggplot(mc3_edges_owner, aes(x = no_of_companies)) +
  geom_histogram() +
  labs(title = "No of companies beneficial owners own", x = "No of companies", y = "Count") +
  theme(plot.title = element_text(face = "bold")) +
  scale_x_continuous(breaks = seq(min(mc3_edges_owner$no_of_companies), max(mc3_edges_owner$no_of_companies), by = 1))

# Calculate frequency counts for each bin
freq_counts <- table(mc3_edges_owner$no_of_companies)

# Create a data frame for labels
label_data <- data.frame(x = as.numeric(names(freq_counts)), y = as.numeric(freq_counts))

# Add frequency labels to the plot
gg_hist_own <- gg_hist_own +
  geom_text(
    data = label_data,
    aes(x = x, y = y, label = y),
    vjust = -0.5,
    size = 3
  )

# Display the ggplot histogram
# gg_hist_own

We can combine the plot of the distribution of the type of relationship and the distribution of companies beneficial owners own using patchwork.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
combined_plot <- hist_type / gg_hist_own
combined_plot

As we can see above, there are a small percentage (<0.5%) of beneficial owners that own more than 3 companies. These owners will be flagged as suspicious, and we will perform further investigations on them.

Next, I will create a new dataframe for edge called mc3_edges_with_no_of_companies which has the no_of_companies column added in.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Join the no_of_companies column from mc3_edges_owner into mc3_edges
mc3_edges_with_no_of_companies <- mc3_edges %>%
  left_join(mc3_edges_owner %>% select(target, no_of_companies),
            by = c("target" = "target")) %>%
  mutate(no_of_companies = ifelse(is.na(no_of_companies), 0, no_of_companies))


# View the updated mc3_edges
mc3_edges_with_no_of_companies
# A tibble: 24,036 × 5
   source                      target             type   weights no_of_companies
   <chr>                       <chr>              <chr>    <int>           <dbl>
 1 1 AS Marine sanctuary       Christina Taylor   Compa…       1               1
 2 1 AS Marine sanctuary       Debbie Sanders     Benef…       1               1
 3 1 Ltd. Liability Co Cargo   Angela Smith       Benef…       1               1
 4 1 S.A. de C.V.              Catherine Cox      Compa…       1               0
 5 1 and Sagl Forwading        Angela Mendoza     Compa…       1               0
 6 1 and Sagl Forwading        Christopher Watson Benef…       1               1
 7 2 Limited Liability Company Amanda Mcdonald    Benef…       1               1
 8 2 Limited Liability Company Megan Padilla      Compa…       1               0
 9 2 Limited Liability Company Monica Martinez    Compa…       1               0
10 2 Limited Liability Company Teresa Collins     Benef…       1               1
# ℹ 24,026 more rows

3.4.3 Exploring the nodes data frame

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.

skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁

In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.

DT::datatable(mc3_nodes)

For product services column that have NA values, we will input the value as “0”. For revenue_omu column that has NA or unknown value, we will replace it as “unknown”.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

mc3_nodes <- mc3_nodes %>%
  mutate(product_services = ifelse(product_services == "character(0)", "unknown", product_services),
         revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))

Distribution of the type of nodes

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

hist_type_node <- ggplot(data = mc3_nodes,
       aes(x = type)) +
  geom_bar()+
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
  labs(title = "Distribution of Node Type", x = "Type", y = "Count") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold")) 
  
#hist_type_node

Distribution of number of countries for each id

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Count the number of unique countries for each ID
country_counts <- mc3_nodes %>%
  group_by(id) %>%
  summarize(unique_countries = n_distinct(country))

#Calculate the no of unique countries each ID has

# Calculate the frequency count for each country
frequency_table_country <- table(country_counts$unique_countries)

# Convert the frequency table to a data frame
frequency_df_country <- as.data.frame(frequency_table_country)

# Rename the columns
colnames(frequency_df_country) <- c("Unique Countries", "Frequency")

# Display the frequency table
frequency_df_country
  Unique Countries Frequency
1                1     22783
2                2       131
3                3        12
4                4         2
5                9         1
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Plot the frequency table as a bar plot with labels
hist_country <- ggplot(frequency_df_country, aes(x = `Unique Countries`, y = Frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = Frequency), vjust = -0.5, size = 3.5) +  # Add labels to the bars
  labs(title = "Count of Countries for each ID",
       x = "No of Countries",
       y = "Count") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold"))
#hist_country

From the above plot, we could see there are 146 ids that have more than 1 countries, which calls for suspicious.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Count the number of unique rev for each ID
rev_counts <- mc3_nodes %>%
  group_by(id) %>%
  summarize(unique_rv = n_distinct(revenue_omu))

# Display the resulting data frame
#rev_counts

# Calculate the frequency count for each id
frequency_table_rev <- table(rev_counts$unique_rv)

# Convert the frequency table to a data frame
frequency_df_rev <- as.data.frame(frequency_table_rev)

# Rename the columns
colnames(frequency_df_rev) <- c("Unique rev", "Frequency")

# Display the frequency table
frequency_df_rev
  Unique rev Frequency
1          1     22238
2          2       591
3          3        76
4          4        14
5          5         4
6          6         2
7          7         2
8         10         1
9         11         1
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Plot the frequency table as a bar plot with labels
hist_rev <- ggplot(frequency_df_rev, aes(x = `Unique rev`, y = Frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = Frequency), vjust = -0.5, size = 3.5) +  # Add labels to the bars
  labs(title = "Count of no of rev for each ID",
       x = "No of rev",
       y = "Count") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold"))
#hist_rev

From the above, we can also see that there are 691 ids that have more than 1 revenue reflected.

Combine the different plots using patchwork as shown by code chunk below,

Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4

combine_plot_node <- hist_type_node / (hist_country + hist_rev)
combine_plot_node

Now, I want to a new dataframe for nodes called mc3_nodes_updated to store the frequency of countries and revenue we derive earlier on to see which id these belongs to.

Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Join the unique_countries column from country_counts into mc3_nodes
mc3_nodes_updated <- mc3_nodes %>%
  left_join(country_counts %>% select(id, unique_countries),
            by = c("id" = "id")) 

# Join the unique_rv column from rev_counts into mc3_nodes
mc3_nodes_updated <- mc3_nodes_updated %>%
  left_join(rev_counts %>% select(id, unique_rv),
            by = c("id" = "id"))

# View the updated mc3_nodes
mc3_nodes_updated
# A tibble: 27,622 × 7
   id      country type  revenue_omu product_services unique_countries unique_rv
   <chr>   <chr>   <chr> <chr>       <chr>                       <int>     <int>
 1 Jones … ZH      Comp… 310612303.… Automobiles                     1         2
 2 Colema… ZH      Comp… 162734683.… Passenger cars,…                1         1
 3 Aqua A… Oceanus Comp… 115004666.… Holding firm wh…                1         1
 4 Makumb… Utopor… Comp… 90986412.5… Car service, ca…                1         1
 5 Taylor… ZH      Comp… 81466666.6… Fully electric …                1         1
 6 Harmon… ZH      Comp… 75070434.9… Discount superm…                1         1
 7 Punjab… Riodel… Comp… 72167572.0… Beef, pork, chi…                1         1
 8 Assam … Utopor… Comp… 72162317.2… Power and Gas s…                2         2
 9 Ianira… Rio Is… Comp… 68832979.2… Light commercia…                1         1
10 Moran,… ZH      Comp… 65592905.5… Automobiles, tr…                1         1
# ℹ 27,612 more rows

3.4.2 Initial Network Visualisation and Analysis

Building network model with tidygraph

filtered_mc3_edges_owner <- mc3_edges_with_no_of_companies %>%
  filter(no_of_companies > 3, type == "Beneficial Owner")
filtered_mc3_edges_owner
# A tibble: 313 × 5
   source                           target         type  weights no_of_companies
   <chr>                            <chr>          <chr>   <int>           <dbl>
 1 Acevedo, Dickson and Gonzalez    Richard Smith  Bene…       1               6
 2 Adams Group                      John Smith     Bene…       1               9
 3 Adams-Pope                       Michelle Rodr… Bene…       1               4
 4 Adriatic Catch S.A. de C.V.      David Jones    Bene…       1               6
 5 Albertine Rift  NV Family        Michael Taylor Bene…       1               4
 6 Alexander PLC                    David Jones    Bene…       1               6
 7 Alvarez Ltd                      Michael Carter Bene…       1               5
 8 Alvarez, Young and Ramos         Michael Miller Bene…       1               5
 9 Ancla del Este Ltd. Liability Co Aaron Jones    Bene…       1               4
10 Ancla del Este Sp Fish           John Jones     Bene…       1               4
# ℹ 303 more rows
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Create a data frame with source nodes and rename column
id1 <- filtered_mc3_edges_owner %>%
  select(source) %>%
  rename(id = source) %>%
  mutate(type_node = "company")

# Create a data frame with target nodes and rename column
id2 <- filtered_mc3_edges_owner %>%
  select(target, type) %>%
  rename(id = target, type_node = type)

# Combine the two data frames and remove duplicates
mc3_nodes1 <- rbind(id1, id2) %>%
  distinct() 

#see if need add in some of the nodes detail 
 mc3_nodes1
# A tibble: 362 × 2
   id                               type_node
   <chr>                            <chr>    
 1 Acevedo, Dickson and Gonzalez    company  
 2 Adams Group                      company  
 3 Adams-Pope                       company  
 4 Adriatic Catch S.A. de C.V.      company  
 5 Albertine Rift  NV Family        company  
 6 Alexander PLC                    company  
 7 Alvarez Ltd                      company  
 8 Alvarez, Young and Ramos         company  
 9 Ancla del Este Ltd. Liability Co company  
10 Ancla del Este Sp Fish           company  
# ℹ 352 more rows
DT::datatable(mc3_nodes1)
mc3_graph <- tbl_graph(nodes = mc3_nodes1,
                       edges = filtered_mc3_edges_owner,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4

# Set a seed for reproducibility
set.seed(123)

mc3_graph %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    colors = "lightblue",
    alpha = 0.5)) +
  scale_size_continuous(range=c(1,10))+
  theme_graph()

Preparing Network Data for visNetowrk

Instead of plotting static network graph, we can plot interactive network graph by using visNetwork package. Before we can plot a interactive network graph by using visNetwork package, we are required to prepare two tibble data frames, one for the nodes and the other one for the edges.

Preparing edges tibble data frame

edges_df <- mc3_graph %>%
  activate(edges) %>%
  as.tibble()
edges_df
# A tibble: 313 × 5
    from    to type             weights no_of_companies
   <int> <int> <chr>              <int>           <dbl>
 1     1   296 Beneficial Owner       1               6
 2     2   297 Beneficial Owner       1               9
 3     3   298 Beneficial Owner       1               4
 4     4   299 Beneficial Owner       1               6
 5     5   300 Beneficial Owner       1               4
 6     6   299 Beneficial Owner       1               6
 7     7   301 Beneficial Owner       1               5
 8     8   302 Beneficial Owner       1               5
 9     9   303 Beneficial Owner       1               4
10    10   304 Beneficial Owner       1               4
# ℹ 303 more rows

Preparing nodes tibble data frame

In this section, we will prepare a nodes tibble data frame by using the code chunk below.

nodes_df <- mc3_graph %>%
  activate(nodes) %>%
  as.tibble() %>%
  rename(label = id) %>%
  mutate(id=row_number()) %>%
  select(everything()) %>%
  relocate(id, .before = label)
nodes_df <- nodes_df %>%
  rename(group = type_node) 
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4

# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df, edges_df, main = list(text = "Network Graph of Company and Beneficial Owner",
                                           style = "color: black; font-weight: bold; text-align: center;")) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visLayout(randomSeed = 123) %>%
  addFontAwesome(name ="font-awesome") %>%
  visGroups(groupname = "company", shape = "icon",
            icon = list(code = "f0f7", color = "#000000")) %>%
  visGroups(groupname = "Beneficial Owner", shape = "icon",
            icon = list(code = "f2bd")) %>%
  visLegend() %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
  ) %>%
  visInteraction(
    zoomView = TRUE,
    dragNodes = TRUE,
    dragView = TRUE,
    navigationButtons = TRUE,
    selectable = TRUE,  # Enable node selection
    hover = TRUE,  # Enable hover effects
  )

Similarly, to plot the network graph of Company and Company Contacts, we do the same as above,

#Filter the type = "Company Contacts"
mc3_edges_cc<- mc3_edges_with_no_of_companies %>%
  filter(no_of_companies > 3, type == "Company Contacts") 
mc3_edges_cc
# A tibble: 72 × 5
   source                                   target type  weights no_of_companies
   <chr>                                    <chr>  <chr>   <int>           <dbl>
 1 Adriatic Tuna GmbH & Co. KG              Chris… Comp…       1               4
 2 Alvarez and Sons                         Rober… Comp…       1               4
 3 Andhra Pradesh   Limited Liability Comp… Miche… Comp…       1               4
 4 Austin-Porter                            Micha… Comp…       1               4
 5 Bahía del Este Ges.m.b.H.                Micha… Comp…       1               4
 6 Baker-Savage                             Melis… Comp…       1               4
 7 Brown-Frank                              John … Comp…       1               9
 8 Caracola del Este Sagl Solutions         Micha… Comp…       1               5
 9 Clayton Ltd                              Brian… Comp…       1               5
10 Coleman, Harris and Mitchell             John … Comp…       1               7
# ℹ 62 more rows
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Create a data frame with source nodes and rename column
id3 <- mc3_edges_cc %>%
  select(source) %>%
  rename(id = source) %>%
  mutate(type_node = "company")

# Create a data frame with target nodes and rename column
id4 <- mc3_edges_cc %>%
  select(target, type) %>%
  rename(id = target, type_node = type)

# Combine the two data frames and remove duplicates
mc3_nodes2 <- rbind(id3, id4) %>%
  distinct()

#see if need add in some of the nodes detail 
mc3_graph2 <- tbl_graph(nodes = mc3_nodes2,
                       edges = mc3_edges_cc,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4

# Set a seed for reproducibility
set.seed(123)

mc3_graph2 %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    colors = "lightblue",
    alpha = 0.5)) +
  scale_size_continuous(range=c(1,10))+
  theme_graph()

edges_df_2 <- mc3_graph2 %>%
  activate(edges) %>%
  as.tibble()
nodes_df_2 <- mc3_graph2 %>%
  activate(nodes) %>%
  as.tibble() %>%
  rename(label = id) %>%
  mutate(id=row_number()) %>%
  select(everything()) %>%
  relocate(id, .before = label)
nodes_df_2 <- nodes_df_2 %>%
  rename(group = type_node) 
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4

# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df_2, edges_df_2, main = list(text = "Network Graph of Company and Company Contacts",
                                           style = "color: black; font-weight: bold; text-align: center;")) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visLayout(randomSeed = 123) %>%
  addFontAwesome(name ="font-awesome") %>%
  visGroups(groupname = "company", shape = "icon",
            icon = list(code = "f0f7", color = "#000000")) %>%
  visGroups(groupname = "Company Contacts", shape = "icon",
            icon = list(code = "f0c0")) %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE,
  ) %>%
  visLegend() %>%
  visInteraction(
    zoomView = TRUE,
    dragNodes = TRUE,
    dragView = TRUE,
    navigationButtons = TRUE,
    selectable = TRUE,  # Enable node selection
    hover = TRUE,  # Enable hover effects
  )

Top 5% revenue

filtered_mc3_edges <- mc3_edges_with_no_of_companies %>%
  filter(no_of_companies > 3)
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Create a data frame with source nodes and rename column
id4 <- filtered_mc3_edges %>%
  select(source) %>%
  rename(id = source) %>%
  mutate(type_node = "company")

# Create a data frame with target nodes and rename column
id5 <- filtered_mc3_edges %>%
  select(target, type) %>%
  rename(id = target, type_node = type)

# Combine the two data frames and remove duplicates
mc3_nodes3 <- rbind(id4, id5) %>%
  distinct() %>%
  left_join(mc3_nodes_updated,
            unmatched = "drop") %>%
  distinct()

mc3_nodes3 <- mc3_nodes3 %>%
  mutate(revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))

#see if need add in some of the nodes detail 
 mc3_nodes3
# A tibble: 535 × 8
   id      type_node country type  revenue_omu product_services unique_countries
   <chr>   <chr>     <chr>   <chr> <chr>       <chr>                       <int>
 1 Aceved… company   <NA>    <NA>  0           <NA>                           NA
 2 Adams … company   ZH      Comp… 9056.2418   A range of fish…                1
 3 Adams … company   ZH      Bene… 0           unknown                         1
 4 Adams … company   ZH      Comp… 0           unknown                         1
 5 Adams-… company   <NA>    <NA>  0           <NA>                           NA
 6 Adriat… company   Puerto… Comp… 8869.44     Technical testi…                1
 7 Adriat… company   Oceanus Comp… 29366.6728  Integrated frei…                1
 8 Albert… company   Marebak Comp… 9760.8727   Alaska Pollock,…                1
 9 Alexan… company   ZH      Bene… 0           unknown                         1
10 Alvare… company   ZH      Bene… 0           unknown                         1
# ℹ 525 more rows
# ℹ 1 more variable: unique_rv <int>
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3

# Convert the revenue column to numeric (if it's not already numeric)
mc3_nodes3$revenue_omu <- as.numeric(mc3_nodes3$revenue_omu)

# Calculate the revenue threshold for the top 20% excluding non-numeric or missing values
revenue_threshold <- quantile(mc3_nodes3$revenue_omu, probs = 0.90, na.rm = TRUE)

# Filter the DataFrame to retain only the rows with revenue above the threshold
filtered_mc3_nodes <- mc3_nodes3[mc3_nodes3$revenue_omu > revenue_threshold, ]

# View the filtered DataFrame
filtered_mc3_nodes
# A tibble: 54 × 8
   id      type_node country type  revenue_omu product_services unique_countries
   <chr>   <chr>     <chr>   <chr>       <dbl> <chr>                       <int>
 1 Ancla … company   Uzifri… Comp…     130212. Operation of fi…                1
 2 Andhra… company   Rio Is… Comp…     787121. Grocery products                1
 3 Bahía … company   Novarc… Comp…      60335. Fabricated meta…                1
 4 Bahía … company   Oceanus Comp…     254667. Swimwear and fa…                2
 5 Bahía … company   Novarc… Comp…      98065. Contract manufa…                3
 6 Bahía … company   Utopor… Comp…      67616. Gelatin                         3
 7 Baker … company   ZH      Comp…  104095830. Fish; fresh or …                1
 8 BlueWa… company   Zawali… Comp…     199596. Canned Products…                1
 9 Bu yu … company   Nalako… Comp…      62860. Gelatine produc…                1
10 Congo … company   Riodel… Comp…     106161. Writing tools a…                1
# ℹ 44 more rows
# ℹ 1 more variable: unique_rv <int>
Show the code
#| echo: false
#| fig-width: 5
#| fig-height: 6

# Create a bar chart of revenue vs ID using ggplot
bar_plot_toprev <- ggplot(filtered_mc3_nodes, aes(x = reorder(id, revenue_omu), y = revenue_omu/1000)) +
  geom_bar_interactive(aes(tooltip = paste("ID:", id,
                                           "<br>Type:", type_node,
                                           "<br>Country:", country,
                                           "<br>Revenue:", revenue_omu,
                                           "<br>Product Services:", product_services)),
                       stat = "identity", fill = "steelblue") +
  labs(x = "id", y = "Revenue_omu ('000)", title = "Top 10% ids") +
  coord_flip() +
  theme(plot.title = element_text(face = "bold"))+
  theme(axis.text.y = element_text(size = 6))

# Print the bar plot
girafe(ggobj = bar_plot_toprev,
       width_svg = 8,
  height_svg = 8*0.618)